if (!requireNamespace("Seurat", quietly = TRUE))
install.packages("Seurat")
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages("tidyverse")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
if (!requireNamespace("decoupleR", quietly = TRUE))
remotes::install_github('saezlab/decoupleR')
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("glue", quietly = TRUE))
install.packages("glue")
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages("colorBlindness")
if (!requireNamespace("ggpmisc", quietly = TRUE))
install.packages("ggpmisc")
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
library(Seurat)
library(tidyverse)
library(decoupleR)
library(ComplexHeatmap)
library(colorBlindness)
library(ggpmisc)
# Remember to set a seed so the analysis is reproducible!
set.seed(687)9 - Transcription Factor Activity - How to score & interpret them
Introduction
scRNA-seq data returns individual molecular reads for each cell representing the expression of each gene in each cell. However, transcript abundances at the individual gene level can be hard to interpret. Another confounding factor with these readouts is the high sparsity of the data. This sparsity acutely affects genes with low mRNA abundance (Figure 1) Mereu et al. (2020) . Transcription factors (TFs) are key players involved in regulating the present and future cell states by binding to regulatory regions in the DNA and driving gene expression programs Baskar et al. (2022). Therefore, they are tightly regulated and are often found at low abundances due to their powerful effects on the cells. Hence, being able to quantify the activity of TFs in a cell can provide very valuable information when characterizing the biological processes underlying a cell type or state. However, due to their low expression they severely suffer from dropout events and their mRNA abundance can’t be accurately quantified by looking at the number of UMIs. To address this issue methods have been developed to quantify their activities by leveraging the expression of the genes they regulate.
Figure 1. Dropout Probability vs Expression level
In this vignette we are going to go over the best practices on how to compute these activites, what we need to take into account when computing them and follow a recent benchmarking paper to determine which are the best reference databases Müller-Dott et al. (2023) and tools to use like decoupler Badia-i-Mompel et al. (2022)!
Before we start here are some key concepts that will help us and frame the vignette!
What is a transcription factor?
Transcription factors are broadly understood as proteins that bind to regulatory regions of the DNA acting as key regulators of gene-expression programs Baskar et al. (2022).
What information do we need to compute the activity of a TF?
The activity of TF is scored based on the expression of the genes it regulates. Therefore, we need a database that contains which genes are regulated by each transcription factor and the relation between them. Some TF can activate the expression of some genes and repress the expression of others. There are many databases that contain this information and in this vignette we aim to provide the current state of the art databases to use, this can be considered as a gene regulatory network (GRN).
Do transcription factors act the same in all cell types?
No! This is crucial to keep in mind when interpreting TF activities. If we take as an example Blimp1 (PRDM1) a well characterized TF in B and T cells it has been shown to have very different functions. In B cells, Blimp1 drives plasmablast formation and antibody secretion, whereas in T cells, Blimp1 regulates functional differentiation, including cytokine gene expression. Studies have determined both conserved and unique functions of Blimp1 in different immune cell subsets such as the unique direct activation of the igh gene transcription in B cells and a conserved antagonism with BCL6 in B cells, T cells, and myeloid cells Nadeau and Martins (2022). This is important to consider because ideally we would have gene-specific GRNs. These can be obtained from multiome datasets but most of the time we don’t have this information, hence reference GRNs are available for these cases and this is what is going to be covered in this vignette.
How do we score them in our dataset?
There are many ways to score the activation of TFs as shown in the
decouplerpaper Badia-i-Mompel et al. (2022). However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don’t worry, we’ll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.How do we interpret the activity obtained?
Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).
Can we further interrogate the activity scores obtained?
Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that activity. Sometimes with TF regulating many genes downstream it could be that just a few genes are contributing to its activity in our dataset. Therefore, if we just stopped at the activity score we could be mislead into thinking that all of the genes downstream of the TF are important when it , usually, is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores for one TF but vastly different genes gene programs underlying them.
Libraries
Load the libraries and install the packages needed to run this notebook
Load data
We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.
# Download the data in data/ directory
# download.file(
# url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
# destfile = "../data/workshop-data.rds",
# method = "wget",
# extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds
se <- readRDS("../data/workshop-data.rds")
# Remove Uncategorized
se <- se[, ! se$Celltype %in% c("Uncategorized1", "Uncategorized2")]
se$Celltype <- as.character(se$Celltype)
# Subset for the sake of speed
se <- se[, sample(colnames(se), 0.25 * ncol(se))]Generate a color palette for our cell types
# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
nb.cols <- length(unique(se$Celltype))
pal <- colorRampPalette(paletteMartin)(nb.cols)
names(pal) <- sort(unique(se$Celltype))Analysis
Convert ENSEMBL IDs to Gene Symbols
Right away we can see how ensembl ids are used in the rownames. Let’s transform them into their matched symbols to make them human-readable:
head(rownames(se))[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"
Convert to gene symbols
gene_df <- readr::read_csv(file = "../data/cov_flu_gene_names_table.csv")
symbol_id <- data.frame(index = rownames(se)) %>%
left_join(gene_df, by = "index") %>%
pull(feature_name)
# re-create seurat object
mtx <- se@assays$RNA$data
rownames(mtx) <- symbol_id
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)Preprocessing
We will do a quick preprocessing of the data. 1) log-normalize, 2) identify highly variable genes, 3) scale their expression and 4) compute PCA on the scaled data.
se <- se %>%
NormalizeData(verbose = FALSE) %>%
FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>%
ScaleData(verbose = FALSE) %>%
RunPCA(verbose = FALSE)Next we check the elbow plot to determine the number of PCs to use for the downstream analysis and then compute UMAP, K-nearest neighbor graph (KNN graph) and run Louvain clustering on it.
# Look at elbow plot to assess the number of PCs to use
ElbowPlot(se)We can see a clear elbow at 10 PCs, we’re going to extend it a bit more and use 15 PCs for the downstream analysis to make sure we are not loosing any biological signal
se <- RunUMAP(se, reduction = "pca", dims = 1:30, verbose = FALSE)Next we compute the K-nearest-neighbor graph
# se <- se %>%
# FindNeighbors(verbose = FALSE) %>%
# FindClusters(resolution = c(0.05, 0.1, 0.125, 0.15, 0.2), verbose = FALSE)
#
# # Visualize the clustering on the UMAP
# DimPlot(
# se,
# group.by = c(
# "RNA_snn_res.0.05", "RNA_snn_res.0.1", "RNA_snn_res.0.125",
# "RNA_snn_res.0.15", "RNA_snn_res.0.2"))For the purpose of this tutorial we are going to proceed with resolution 0.15
(dim_plt <- DimPlot(
se,
group.by = "Celltype") +
scale_color_manual(values = pal))TF activity scoring
As mentioned in the introduction we need a GRN specifying the relation between a TF and its downstream genes. A recent benchmark has shown how CollecTRI is the best database to estimate TF activities Müller-Dott et al. (2023). CollecTRI contains regulons of signed TF-target genes that have been compiled from 12 different sources, provides increased TF coverage, and in a recent benchmark showed a superior performance in indentifying TF perturbations from gene expression data Müller-Dott et al. (2023). You can see the full explanation in the CollecTRI gihtub page from the Saez lab. We will also follow the decoupler vignette on how to download and score these signatures.
net <- get_collectri(organism = "human", split_complexes = FALSE)
net# A tibble: 43,178 × 3
source target mor
<chr> <chr> <dbl>
1 MYC TERT 1
2 SPI1 BGLAP 1
3 SMAD3 JUN 1
4 SMAD4 JUN 1
5 STAT5A IL2 1
6 STAT5B IL2 1
7 RELA FAS 1
8 WT1 NR0B1 1
9 NR0B2 CASP1 1
10 SP1 ALDOA 1
# ℹ 43,168 more rows
Note we are using organism = "human" since we are working with human data - mouse and rat are also available and split_complexes=False since we want to ensure that TF that regulate complexes downstream are scored when the multiple subunits in the complex are present.
In the net dataframe we have the regulation relation between each TF (source) and their downstream genes
# Look at the targets of MYC
net %>% dplyr::filter(source == "MYC")# A tibble: 888 × 3
source target mor
<chr> <chr> <dbl>
1 MYC TERT 1
2 MYC EPO 1
3 MYC ENO1 1
4 MYC CDC25A 1
5 MYC EMP1 1
6 MYC CXCR4 1
7 MYC CDKN1A -1
8 MYC TP53 1
9 MYC ACTB 1
10 MYC BRCA1 1
# ℹ 878 more rows
# Look at the weights of MYC
Hmisc::describe(net$mor)net$mor
n missing distinct Info Mean Gmd
43178 0 2 0.35 0.7306 0.4662
Value -1 1
Frequency 5816 37362
Proportion 0.135 0.865
From here we can gather that in this database MYC is regulating 886 genes-gene complexes. The weight indicates if MYC activates or represses the activity of a genes. Target genes with values > 0 indicate that MYC promotes the expression and viceversa in those with a weight < 1.
It is important to note that here we are using CollecTRI as a reference database. Therefore, some TF may be missing or the mode of regulation may be different from the expected in a cell type of interest. If you have inhouse curated GRNs obtained from inference methods such as CellOracle, pySCENIC or SCENIC+ you can plug them in here. These have the advantage that you can compute GRNs for your cell types and in turn use them in your scRNAseq dataset!
Activity inference with univariate linear model (ULM)
“Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator.”
Figure 2. Univariate linear model method graphical representation.
Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!
So lets compute the activity scores for every cell in our dataset!
ulm_start <- Sys.time()
res <- decoupleR::run_ulm(
mat = se@assays$RNA$data,
network = net,
.source = "source",
.target = "target",
.mor = "mor")
glue::glue("Time to run ulm is {round(difftime(Sys.time(), ulm_start, units = 'min'), 0)} minutes")Time to run ulm is 6 minutes
Save data
We can see how every cells has a score for every signature!
# Looking at the first 10 entries
res# A tibble: 11,217,279 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm MYC AAACCCAAGTAAGGGA-18 27.8 2.40e-168
2 ulm MYC AAACCCACACTGGAAG-16 29.6 2.58e-190
3 ulm MYC AAACCCACAGCTGAAG-1 24.8 3.78e-134
4 ulm MYC AAACCCACATGAGATA-5 21.4 2.49e-101
5 ulm MYC AAACCCAGTCTACAGT-14 23.4 2.27e-120
6 ulm MYC AAACCCAGTCTTCGAA-1 27.9 7.09e-170
7 ulm MYC AAACCCAGTGTTAGCT-3 24.5 2.72e-131
8 ulm MYC AAACCCATCAAGCGTT-15 23.8 2.03e-124
9 ulm MYC AAACCCATCACTGTCC-16 10.8 3.80e- 27
10 ulm MYC AAACCCATCGCGAAGA-16 23.7 1.71e-123
# ℹ 11,217,269 more rows
# Check the cell type for cell AAACCCAAGGGCAATC-1
ct_b <- "AAACCCACAGCTGAAG-1"
ct_nk <- "ACCAAACTCGGTGTTA-5"
se@meta.data[c(ct_b, ct_nk), "Celltype", drop = FALSE] Celltype
AAACCCACAGCTGAAG-1 B cell, IgG-
ACCAAACTCGGTGTTA-5 NK cell
# Looking at all the scores for one specific cell
res %>%
filter(condition %in% c(ct_b, ct_nk)) %>%
arrange(condition, desc(score))# A tibble: 1,542 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm MYC AAACCCACAGCTGAAG-1 24.8 3.78e-134
2 ulm RFXAP AAACCCACAGCTGAAG-1 21.7 2.48e-103
3 ulm RFXANK AAACCCACAGCTGAAG-1 20.3 2.81e- 91
4 ulm CIITA AAACCCACAGCTGAAG-1 20.2 2.70e- 90
5 ulm RFX5 AAACCCACAGCTGAAG-1 18.6 1.29e- 76
6 ulm ZBTB4 AAACCCACAGCTGAAG-1 14.2 1.11e- 45
7 ulm ZBED1 AAACCCACAGCTGAAG-1 13.0 1.61e- 38
8 ulm HIVEP2 AAACCCACAGCTGAAG-1 11.3 1.46e- 29
9 ulm RFX1 AAACCCACAGCTGAAG-1 10.2 1.44e- 24
10 ulm NFKB AAACCCACAGCTGAAG-1 8.61 7.69e- 18
# ℹ 1,532 more rows
Visualization
We can directly add the ulm scores to an assay in our object and visualize the results
se[['tfulm']] <- res %>%
pivot_wider(
id_cols = 'source',
names_from = 'condition',
values_from = 'score') %>%
column_to_rownames('source') %>%
Seurat::CreateAssayObject(.)
# Change assay
DefaultAssay(object = se) <- "tfulm"
# Scale the data for comparison across signatures
se <- ScaleData(se)
se@assays$tfulm@data <- se@assays$tfulm@scale.dataUMAP visualization
Plot TF activities on the UMAP embedding
plt <- FeaturePlot(
se,
features = c("PAX5", "EBF1", "EOMES"),
ncol = 2,
slot = "data") &
scale_color_viridis_c(option = "magma")
plt + dim_pltAs mentioned in the introduction, TF are usually very lowly expressed and thus the mRNAs suffer from higher droput rates than more highly expressed genes with more mRNA copies. Using TF activities showcases how we can recover their activity by looking at the genes a TF regulates downstream.
# TF activity scores
DefaultAssay(se) <- "tfulm"
plt1 <- FeaturePlot(
se,
features = c("PAX5", "EBF1", "EOMES"),
ncol = 3,
slot = "data") &
scale_color_viridis_c(option = "magma")
DefaultAssay(se) <- "RNA"
plt2 <- FeaturePlot(
se,
features = c("PAX5", "EBF1", "EOMES"),
ncol = 3,
slot = "data") &
scale_color_viridis_c(option = "magma")
plt1 / plt2We can see how the difference is night and day. In the top row we see the TF activities while in the bottom the normalized gene expression. In all 3 examples (PAX5, EBF1, EOMES) the transcript is rarely detected while the activity is well recapitulated.
Heatmap by cells
We can also visualize the top 25 most highly variable TF scores using a heatmap
n_tfs <- 25
# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$tfulm@counts)) %>%
as.data.frame() %>%
mutate(cluster = Idents(se)) %>%
pivot_longer(cols = -cluster, names_to = "source", values_to = "score")
# Get top tfs with more variable means across **cells**
tfs <- df %>%
group_by(source) %>%
summarise(std = sd(score)) %>%
arrange(-abs(std)) %>%
head(n_tfs) %>%
pull(source)
glue::glue("Note that the maximum scaled value is: {round(max(se@assays$tfulm@data[tfs, ]), 2)}, and the minimum is {round(min(se@assays$tfulm@data[tfs, ]), 2)}.")Note that the maximum scaled value is: 10, and the minimum is -11.16.
DoHeatmap(
se,
features = tfs,
group.by = "Celltype",
assay = "tfulm",
disp.max = quantile(se@assays$tfulm@data[tfs, ], .99),
disp.min = quantile(se@assays$tfulm@data[tfs, ], .01),
slot = "data") +
scale_fill_viridis_c(option = "viridis")Heatmap by groups
From the plot above we can see how we have very distinct populations in our datasets. We can also look at it a bit less granular by looking at the mean activity score per cluster.
n_tfs <- 25
# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$tfulm@data)) %>%
as.data.frame() %>%
mutate(cluster = se$Celltype) %>%
pivot_longer(cols = -cluster, names_to = "source", values_to = "score") %>%
group_by(cluster, source) %>%
summarise(mean = mean(score))
# Get top tfs with more variable means across **clusters**
tfs <- df %>%
group_by(source) %>%
summarise(std = sd(mean)) %>%
arrange(-abs(std)) %>%
head(n_tfs) %>%
pull(source)
# Transform to wide matrix
top_acts_mat <- df %>%
filter(source %in% tfs) %>%
pivot_wider(id_cols = 'cluster', names_from = 'source',
values_from = 'mean') %>%
column_to_rownames('cluster') %>%
as.matrix()
# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)
# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")Note that the maximum scaled value is: 5.11, and the minimum is -4.99.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)From this approach we see how RBCs are driving tis visualization. When this happens we can follow another approach in which we show the top activated TF per cluster.
# Get top tfs with more variable means across **clusters**
tfs_df <- df %>%
group_by(cluster) %>%
top_n(n = 5, wt = abs(mean))
# We can see how cluster 6 has the highest TF activity scores
# we are going to take this into account when visualizing the data!
tfs_df %>% head()# A tibble: 6 × 3
# Groups: cluster [2]
cluster source mean
<chr> <chr> <dbl>
1 B cell, IgG+ CHD4 -2.22
2 B cell, IgG+ EBF1 2.20
3 B cell, IgG+ ELF2 1.94
4 B cell, IgG+ PAX5 1.67
5 B cell, IgG+ PHF5A 2.68
6 B cell, IgG- CHD4 -2.42
# Extract the TF onto a list
tfs <- tfs_df %>%
pull(source)
# Transform to wide matrix
top_acts_mat <- df %>%
filter(source %in% tfs) %>%
pivot_wider(id_cols = 'cluster', names_from = 'source',
values_from = 'mean') %>%
column_to_rownames('cluster') %>%
as.matrix()
# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)
# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")Note that the maximum scaled value is: 5.11, and the minimum is -4.99.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)We can clearly assess that the transcription factors returned are cell type specific since B cells cluster together, and T cells and monocytes as well.
Heatmap by condition & cell type
Lastly we want to check if there are differences in the TF activation depending on cell type and disease
meta_sub <- se@meta.data %>%
rownames_to_column("bc") %>%
dplyr::select(bc, Celltype, disease)
df <- t(as.matrix(se@assays$tfulm@data)) %>%
as.data.frame() %>%
rownames_to_column("bc") %>%
left_join(meta_sub, by = "bc") %>%
column_to_rownames("bc") %>%
pivot_longer(cols = -c(Celltype, disease), names_to = "source", values_to = "score") %>%
group_by(Celltype, source, disease) %>%
summarise(mean = mean(score)) %>%
mutate(disease_celltype = glue::glue("{disease} - {Celltype}"))
# Get top tfs with more variable means across **clusters**
tfs_df <- df %>%
group_by(disease_celltype) %>%
top_n(n = 10, wt = abs(mean))
# We can see how cluster 6 has the highest TF activity scores
# we are going to take this into account when visualizing the data!
# tfs_df %>% data.frame()
# Extract the TF onto a list
tfs <- tfs_df %>%
pull(source)
# Transform to wide matrix
top_acts_mat <- df %>%
filter(source %in% tfs) %>%
pivot_wider(id_cols = 'disease_celltype', names_from = 'source',
values_from = 'mean') %>%
column_to_rownames('disease_celltype') %>%
as.matrix()
# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)
# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")Note that the maximum scaled value is: 6.58, and the minimum is -6.27.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)This heatmap can be overwhelming, we can take a look at specific populations of interest, in this case monocytes and highlight the most highly variable ones:
n_tfs <- 25
# Get top tfs with more variable means across **clusters**
tfs <- df %>%
# Only keep monocytes
filter(str_detect(Celltype, "Monocyte")) %>%
group_by(source) %>%
summarise(std = sd(mean)) %>%
arrange(-abs(std)) %>%
head(n_tfs) %>%
pull(source)
# Transform to wide matrix
top_acts_mat <- df %>%
filter(source %in% tfs) %>%
filter(str_detect(Celltype, "Monocyte")) %>%
pivot_wider(id_cols = 'disease_celltype', names_from = 'source',
values_from = 'mean') %>%
column_to_rownames('disease_celltype') %>%
as.matrix()
# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)
# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")Note that the maximum scaled value is: 2.03, and the minimum is -1.63.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)In this last scenario we can see how the cell types cluster by major cell type, not by condition. However, in intermediate monocytes there seems to be an enrichment in interferon response genes in those from influenza or covid patients.
Heatmap for gene expression
To fully grasp which genes are driving each gene signature within each cell we want to visualize the gene expression of the genes involved in each gene signature for each cell. We can do so using the ComplexHeatmap package and a little bit of data processing. For ease here is a function you can incorporate in your analysis:
geneHM <- function(
object,
sig_df,
sig_name,
sig_assay,
.source,
.target,
sig_slot = "data",
expr_assay = "RNA",
expr_slot = "data",
grouping = NULL,
grouping_color = NULL,
expr_cols = viridisLite::magma(100)) {
# Extract Gene Expression Matrix from Seurat Object
gene_expr <- GetAssayData(object, assay = expr_assay, slot = expr_slot)
# Subset the genes of the signature from the Gene Expression Matrix
genes_of_interest <- sig_df[, .target][which(sig_df[, .source] %in% sig_name)]
# Subset the genes intersecting between gene expression and genes in signature
g_int <- intersect(rownames(gene_expr), genes_of_interest)
if (length(g_int) < length(genes_of_interest)) {
genes_excluded <- genes_of_interest[!genes_of_interest %in% rownames(gene_expr)]
genes_excluded <- paste(genes_excluded, collapse = ", ")
message(paste0(
"Genes ", genes_excluded,
" are in the gene signature but not in the expression matrix,",
" therefore, they have been excluded."))
}
# Subset expression matrix to only genes of interest
gene_expr <- gene_expr[g_int, ]
# Extract the Scores of the Signature of interest
sig_score <- GetAssayData(object, assay = sig_assay, slot = sig_slot)
sig_vec <- sig_score[sig_name, ]
anno <- data.frame(score = sig_vec)
# Make sure they are in the right order
anno <- anno[colnames(gene_expr), , drop = FALSE]
# Add any metadata if specified
if (!is.null(grouping)) {
meta <- object@meta.data[, grouping, drop = FALSE]
anno <- cbind(anno, meta[rownames(anno), , drop = FALSE])
}
if (any(is.infinite(c(anno$score))))
stop("There are scores with Inf values, please address this outside of this function. It could be because the slot used is scale_data.")
# Make list of color to paint the annotation columns
if (!is.null(grouping) & !is.null(grouping_color)) {
score <- circlize::colorRamp2(
breaks = c(min(anno$score), 0, max(anno$score)),
colors = c("blue", "white", "red"))
color_ls <- append(grouping_color, score)
names(color_ls)[length(color_ls)] <- "score"
} else {
color_ls <- list(
score = circlize::colorRamp2
(breaks = c(min(anno$score), 0, max(anno$score)),
colors = c("blue", "white", "red")),
RNA_snn_res.0.15 = clust_color)
}
# Set the order from most expressing to least expressing
ord <- rownames(anno[order(anno$score, decreasing = TRUE), ])
# Add the score of the signature as annotation in the heatmap
colAnn <- HeatmapAnnotation(
df = anno[ord, , drop = FALSE],
which = 'column',
col = color_ls)
# Visualize the Heatmap with the genes and signature
ht <- ComplexHeatmap::Heatmap(
as.matrix(gene_expr[, ord]),
name = "Gene Expression",
col = expr_cols,
cluster_rows = TRUE,
cluster_columns = TRUE,
column_title = sig_name,
column_names_gp = gpar(fontsize = 14),
show_column_names = FALSE,
top_annotation = colAnn)
# Return ComplexHeatmap
draw(ht)
}
pal B cell, IgG- B cell, IgG+ CD4, EM-like CD4, non-EM-like CD8, EM-like CD8, non-EM-like classical Monocyte DC intermediate Monocyte NK cell nonclassical Monocyte Platelet RBC
"#000000" "#005555" "#55859E" "#FF91C8" "#853CAA" "#0C5ACE" "#B66DFF" "#79BCFF" "#AA91A9" "#922400" "#C26000" "#42E61E" "#FFFF6D"
# Visualize the heatmaps for all signatures
# tt <- lapply(unique(sig_df$signature), function(i) {
# geneHM(
# object = se,
# sig_df = sig_df,
# .source = "signature",
# .target = "gene",
# sig_name = i,
# expr_slot = "data",
# expr_assay = "RNA",
# sig_assay = "pathwaysulm",
# sig_slot = "data",
# grouping = c("RNA_snn_res.0.15"),
# grouping_color = list(RNA_snn_res.0.15 = clust_color))
# })Here are some examples of how to interpret these gene signatures:
- Diving deeper into the EBF1 signature we can see how the cells with a high activity score have a high expression of CD79A/B which have a mode of regulation of 1. In turn, those with a negative score have a high expression of GATA3 whose mor is -1, thus indicating that the EBF1 pathway is turned off.
# Subset to 10% of the original dataset for speed and visualization purposes
se_sub <- se[, sample(colnames(se), 0.1 * ncol(se))]
geneHM(
object = se_sub,
sig_df = data.frame(net),
.source = "source",
.target = "target",
sig_name = "EBF1",
expr_slot = "data",
expr_assay = "RNA",
sig_assay = "tfulm",
sig_slot = "data",
grouping = c("Celltype"),
grouping_color = list(Celltype = pal))net %>% dplyr::filter(source == "EBF1" & target %in% c("CD79A", "CD79B", "GATA3"))# A tibble: 3 × 3
source target mor
<chr> <chr> <dbl>
1 EBF1 CD79A 1
2 EBF1 GATA3 -1
3 EBF1 CD79B 1
Briefly, when scoring TF activities and looking at their activities it is important to not only look at the overall score obtained for each cell but one also needs to dive deeper into which are the genes that are driving that signature!
TF by disease
Lastly we can also look at the TF activity per cell type by condition of interest as below:
disease_pal <- c("#41AE76", "#225EA8", "#E31A1C")
names(disease_pal) <- c("normal", "influenza", "COVID-19")
# make a donor palette
donor_pal <- c(
"#66C2A4", "#41AE76", "#238B45", "#006D2C",
"#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
"#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C",
"#FC4E2A", "#E31A1C", "#BD0026", "#800026")
names(donor_pal) <- c(
"Normal 1", "Normal 2", "Normal 3", "Normal 4",
"Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
"nCoV 1", "nCoV 2", "nCoV 3_4", "nCoV 5",
"nCoV 6", "nCoV 7_8", "nCoV 9_10", "nCoV 11"
)
dd <- bind_cols(se@meta.data, t(as.matrix(se@assays$tfulm@counts))) %>%
tidyr::pivot_longer(
cols = rownames(se@assays$tfulm@counts),
names_to = "TF",
values_to = "score") %>%
group_by(Celltype, disease, donor_id, TF) %>%
summarise(median_score = median(score))
# Let's look at some TF of interest
lapply(c("IRF1", "IRF2", "IRF6", "IRF9"), function(i) {
dd %>%
filter(TF == i) %>%
ggplot(aes(x = disease, y = median_score, fill = disease)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_jitter(aes(color = donor_id), height = 0) +
facet_wrap(~ Celltype) +
scale_fill_manual(values = disease_pal) +
scale_color_manual(values = donor_pal) +
scale_x_discrete(limits = names(disease_pal)) +
theme_classic() +
labs(title = glue::glue("Factors by condition in {i}"))
})[[1]]
[[2]]
[[3]]
[[4]]
Extra!
How does a univariate linear model work?
Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:
# Define vectors of interest
vec1 <- c(-1, -0.5, 1, 2, 5)
vec2 <- c(-0.8, -0.7, 1.4, 1.8, 4)
# Run the linear model
summary(lm(vec2 ~ vec1))
Call:
lm(formula = vec2 ~ vec1)
Residuals:
1 2 3 4 5
-0.04956 -0.36053 0.50658 0.08465 -0.18114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07149 0.19800 0.361 0.74197
vec1 0.82193 0.07920 10.378 0.00191 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3782 on 3 degrees of freedom
Multiple R-squared: 0.9729, Adjusted R-squared: 0.9639
F-statistic: 107.7 on 1 and 3 DF, p-value: 0.001909
# Visualize the data
(p <- ggplot(mapping = aes(x = vec1, y = vec2)) +
geom_point() +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
geom_hline(yintercept = 0, size = 0.25) +
geom_vline(xintercept = 0, size = 0.25) +
coord_fixed() +
theme_minimal())# now we can add the slope of the line that best fits our data and the T value
p +
stat_poly_line(formula = y ~ x, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)In the example above we see the linear relationship between both vectors and we get the slope and the T value:
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).
- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the \(\frac{coefficient}{standard~error}\).
An example of an activated TF activity score would look like this:
# Define vectors of interest
gex <- c(0, 0, 2.5, 0.65, 1.2)
mor <- c(-1, -1, 1, 1, 1)
# Run the linear model
summary(lm(gex ~ mor))
Call:
lm(formula = gex ~ mor)
Residuals:
1 2 3 4 5
1.884e-16 -1.140e-16 1.050e+00 -8.000e-01 -2.500e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.725 0.354 2.048 0.133
mor 0.725 0.354 2.048 0.133
Residual standard error: 0.7757 on 3 degrees of freedom
Multiple R-squared: 0.5829, Adjusted R-squared: 0.4439
F-statistic: 4.193 on 1 and 3 DF, p-value: 0.133
# Visualize the data
(p <- ggplot(mapping = aes(x = gex, y = mor)) +
geom_jitter(width = 0, height = 0.1) +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
geom_hline(yintercept = 0, size = 0.25) +
geom_vline(xintercept = 0, size = 0.25) +
coord_fixed() +
theme_minimal())# now we can add the slope of the line that best fits our data and the T value
p +
stat_poly_line(formula = y ~ x, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)Here we see how genes with an mor of -1 (the TF represses the expression of that gene) are not expressed (gex = 0), while those with and mor = 1 (the TF promotes the expression of that gene) are expressed!
Now lets look at a “real world” example, we want to score the B cell signature in one cell. First we are going to start by visualizing the relationship between the weights and the gene expression for 2 cells of interest, one is a B cell and the other is not.
We need to do a bit of data prep but bear with me!
# We have our gene expression matrix
mat <- se@assays$RNA$data
# We want to obtain a matrix with 1s and 0s indicating the weight each gene has for each signature
## Initialize mor_mat with all 0s
sources <- unique(net$source)
targets <- rownames(mat)
mor_mat <- matrix(0, ncol = length(sources), nrow = nrow(mat))
colnames(mor_mat) <- sources
rownames(mor_mat) <- targets
weights <- net$mor
# Fill in the matrix with the weights in the right places
for (i in seq_len(nrow(net))) {
.source <- net$source[[i]]
.target <- net$target[[i]]
.weight <- weights[[i]]
if (.target %in% targets) {
mor_mat[[.target, .source]] <- .weight
}
}# labels for geom_text_repel
repel_text <- rownames(mat)
pax5_targets <- net %>% filter(source == "PAX5") %>% pull(target)
pax5_targets <- pax5_targets[pax5_targets %in% rownames(mat)]
keep <- which(rownames(mat) %in% pax5_targets)
# Set non-selected positions to NA
repel_text[-keep] <- NA
# Visualize the data
# Note that we are using geom_bin2d which bins the data to show how many genes are found at each location (x, y)
ct_b <- "AAACCCACAGCTGAAG-1"
ct_nk <- "ACCAAACTCGGTGTTA-5"
ggplot(mapping = aes(x = mat[, ct_b], y = mor_mat[, "PAX5", drop = FALSE])) +
# geom_bin2d(bins = 50) +
geom_point() +
ggrepel::geom_text_repel(aes(label = repel_text)) +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_b} - B cell")) +
coord_fixed() +
scale_fill_gradient(name = "count", trans = "log10") +
xlim(-0.5, 10) +
ylim(-2, 2) + # Set the limits from -2 to 2 since the mor values can be negative
theme_minimal() +
# now we can add the slope of the line that best fits our data and the T value
stat_poly_line(formula = x ~ y, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)# Visualize the data
ggplot(mapping = aes(x = mat[, ct_nk], y = mor_mat[, "PAX5", drop = FALSE])) +
geom_point() +
ggrepel::geom_text_repel(aes(label = repel_text)) +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_nk} - Not a B cell")) +
coord_fixed() +
scale_fill_gradient(name = "count", trans = "log10") +
xlim(-0.5, 10) +
ylim(-2, 2) +
theme_minimal() +
# now we can add the slope of the line that best fits our data and the T value
stat_poly_line(formula = x ~ y, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)Next we are going to manually run the models for these two cells so that we can see that the results obtained from decoupleR make sense!
Lets run the a linear model to score two cell for the B cell signature
mod1 <- lm(as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, "PAX5", drop = FALSE])
summary(mod1)
Call:
lm(formula = as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[,
"PAX5", drop = FALSE])
Residuals:
Min 1Q Median 3Q Max
-0.3038 -0.0581 -0.0581 -0.0581 6.6019
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.058139 0.001866 31.158 < 2e-16 ***
mor_mat[, "PAX5", drop = FALSE] 0.245627 0.041251 5.954 2.64e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.34 on 33232 degrees of freedom
Multiple R-squared: 0.001066, Adjusted R-squared: 0.001036
F-statistic: 35.45 on 1 and 33232 DF, p-value: 2.636e-09
0.245627 / 0.041251 # This equals the T value[1] 5.95445
mod2 <- lm(as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, "PAX5", drop = FALSE])
summary(mod2)
Call:
lm(formula = as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[,
"PAX5", drop = FALSE])
Residuals:
Min 1Q Median 3Q Max
-0.0890 -0.0821 -0.0821 -0.0821 6.2809
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.082097 0.002016 40.724 <2e-16 ***
mor_mat[, "PAX5", drop = FALSE] -0.006913 0.044568 -0.155 0.877
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3673 on 33232 degrees of freedom
Multiple R-squared: 7.241e-07, Adjusted R-squared: -2.937e-05
F-statistic: 0.02406 on 1 and 33232 DF, p-value: 0.8767
0.013364 / 0.043381 # This equals the T value[1] 0.3080611
We can see how mod1 has returned a high coefficient for cell AAACCCACAGCTGAAG-1 while mod2 has returned a low coefficient for cell ACCAAACTCGGTGTTA-5. Moreover, when we look at the T value for the PAX5 we see how in mod1 it is 5.95 while for mod2 it is 0.3080611.
Lets check that the T values obtained manually actually match those returned by decoupleR
res %>% filter(source == "PAX5" & condition %in% c(ct_b, ct_nk))# A tibble: 2 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm PAX5 AAACCCACAGCTGAAG-1 5.95 0.00000000264
2 ulm PAX5 ACCAAACTCGGTGTTA-5 -0.155 0.877
Effectively, they do!
Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggpmisc_0.5.4-1 ggpp_0.5.4 colorBlindness_0.1.9 ComplexHeatmap_2.16.0 decoupleR_2.9.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.0.2 SeuratObject_5.0.1 sp_2.1-3
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.3.2 cellranger_1.1.0 polyclip_1.10-6 confintr_1.0.2 rpart_4.1.19 fastDummies_1.7.3 lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.2 lattice_0.21-8 vroom_1.6.3 MASS_7.3-60 backports_1.4.1 magrittr_2.0.3 Hmisc_5.1-0 plotly_4.10.4 rmarkdown_2.25 yaml_2.3.8 remotes_2.4.2.1 httpuv_1.6.14 sctransform_0.4.1 spam_2.10-0 spatstat.sparse_3.0-3 reticulate_1.35.0.9000 cowplot_1.1.3 pbapply_1.7-2 RColorBrewer_1.1-3 abind_1.4-5 rvest_1.0.3 Rtsne_0.17 BiocGenerics_0.46.0 nnet_7.3-19 rappdirs_0.3.3 circlize_0.4.15 IRanges_2.34.1 S4Vectors_0.38.1 ggrepel_0.9.5 irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.0-4 MatrixModels_0.5-2 goftest_1.2-3 RSpectra_0.16-1 spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.37.0 leiden_0.4.3.1 codetools_0.2-19 xml2_1.3.5
[52] tidyselect_1.2.0 shape_1.4.6 farver_2.1.1 base64enc_0.1-3 matrixStats_1.2.0 stats4_4.3.1 spatstat.explore_3.2-6 jsonlite_1.8.8 GetoptLong_1.0.5 Formula_1.2-5 ellipsis_0.3.2 progressr_0.14.0 ggridges_0.5.6 survival_3.5-7 iterators_1.0.14 foreach_1.5.2 progress_1.2.2 tools_4.3.1 ica_1.0-3 Rcpp_1.0.12 glue_1.7.0 gridExtra_2.3 xfun_0.42 withr_3.0.0 BiocManager_1.30.22 fastmap_1.1.1 fansi_1.0.6 SparseM_1.81 digest_0.6.34 timechange_0.2.0 R6_2.5.1 mime_0.12 gridGraphics_0.5-1 colorspace_2.1-0 Cairo_1.6-1 scattermore_1.2 tensor_1.5 spatstat.data_3.0-4 utf8_1.2.4 generics_0.1.3 data.table_1.15.0 prettyunits_1.1.1 httr_1.4.7 htmlwidgets_1.6.4 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4 lmtest_0.9-40 selectr_0.4-2 OmnipathR_3.8.0 htmltools_0.5.7
[103] dotCall64_1.1-1 clue_0.3-64 scales_1.3.0 png_0.1-8 knitr_1.45 rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4 rjson_0.2.21 curl_5.2.0 checkmate_2.2.0 nlme_3.1-163 zoo_1.8-12 GlobalOptions_0.1.2 KernSmooth_2.23-22 parallel_4.3.1 miniUI_0.1.1.1 foreign_0.8-84 logger_0.2.2 pillar_1.9.0 vctrs_0.6.5 RANN_2.6.1 promises_1.2.1 xtable_1.8-4 cluster_2.1.4 htmlTable_2.4.1 evaluate_0.23 magick_2.7.5 cli_3.6.2 compiler_4.3.1 rlang_1.1.3 crayon_1.5.2 future.apply_1.11.1 labeling_0.4.3 plyr_1.8.9 stringi_1.8.3 viridisLite_0.4.2 deldir_2.0-2 munsell_0.5.0 lazyeval_0.2.2 spatstat.geom_3.2-8 quantreg_5.97 Matrix_1.6-5 RcppHNSW_0.6.0 hms_1.1.3 patchwork_1.2.0 bit64_4.0.5 future_1.33.1 shiny_1.8.0 ROCR_1.0-11 igraph_2.0.2
[154] bit_4.0.5 readxl_1.4.3 polynom_1.4-1